1 What is Machine Learning?

When we pit experts against minimalist performance benchmarks—dilettantes, dart-throwing chimps, and assorted extrapolation algorithms—we find few signs that expertise translates into greater ability to make either “well calibrated” or “discriminating” forecasts.Philip E. Tetlock, Expert Political Judgment

Machine Learning (ML) is the process of extracting information from data. It’s an inductive process: turning a small amount of data into a large amount of knowledge. It’s a “knowledge lever” (Domingos 2012).

The field of Machine Learning contains contributions from Computer Science, Artificial Intelligence, Mathematics and Statistics. It would seem to be grounded in hard sciences. However, there is perhaps as much Art as there is Science in successful Machine Learning.

There are multiple objectives for Machine Learning. The most important ones are

  • understand the process that generated the data; and
  • generate predictions for new data.

In the first case the result would be an explanatory model, while in the second case it would be a predictive model. In both cases the Machine Learning algorithm constructs a model from the data. The model is not constructed on the basis of any theoretical principles and it does not have any explicitly programmed rules.

In an abstract sense a Machine Learning algorithm will try to infer a function which maps from an input space to an output space given a finite set of samples from that function. This can be a tough problem.

… guessing a parent function based on only a finite number of samples of it is an ill-posed problem in the Hadamard sense.Wolpert (1992)

1.1 Motivating Example

Consider the data in the plot below.

This is what a sample of those data looks like.

head(known)
           x           y colour
1 -0.7034740  0.52113670    red
2 -0.3836283 -1.21542643   blue
3  0.2185601 -0.02221164    red
4  1.2246234 -0.11534448   blue
5 -0.8949542 -0.37435041   blue
6  1.1951691  1.47329766   blue

How could we teach a machine to find the appropriate colour for further points? For example, suppose that we had the following additional data points:

unknown
           x          y
1 -0.3336836  0.5153916
2  0.6034826  0.8970194
3  1.0837656 -0.1358888
4  0.4330249  0.4415875
5  0.5033928 -0.9304380
6  0.8690635  0.5904288

How would we know whether their colour should be red or blue?

1.1.1 Naive Solution: Cartesian Coordinates

We could use some simple logic to synthesise a rule based on Cartesian coordinates.

with(unknown, ifelse(x >= -1 & x <= +1 & y >= -1 & y <= +1, "red", "blue"))
[1] "red"  "red"  "blue" "red"  "red"  "red" 

This seems to not be completely unreasonable, although it is far from optimal. It basically fits a square around the circle of red points and assigns red to points within the square and blue to points outside of the square.

Could a computer learn this rule by itself? Yes, it could and you’ll see how it’s done.

1.1.2 Improved Solution: Polar Coordinates

It seems obvious that if we introduce a radial coordinate then we could do a lot better.

unknown <- mutate(unknown, r = sqrt(x**2 + y**2))

A simpler logical rule now produces an even better result.

with(unknown, ifelse(r <= 1, "red", "blue"))
[1] "red"  "blue" "blue" "red"  "blue" "blue"

This process (adding a new variable) is called Feature Engineering and I think that this is a good demonstration of its power. Engineering appropriate features can make a massive difference to the performance of Machine Learning models.

ML: Teaching a chunk of silicon to reason like a slab of meat.DataWookie

1.2 Some Terminology

Like any other specialised field, Machine Learning has a collection of terms which have a domain-specific meaning.

1.2.1 Features

A tidy data set consists of a collection of records (or rows), each of which has a number of fields (or columns). In Machine Learning the fields are called “features”.

The quality of the features (their information content) is the most important factor determining model success. A selection of features which are well correlated with the required outcome will make modelling a breeze.

It’s commonly the case that the raw data does not contain features which are amenable to modelling. In this case you’ll have to engineer new features by transforming and combining the existing features. Beyond cleaning and wrangling your data, a lot of time will also be spent on feature engineering. Whereas Machine Learning algorithms are fairly generic and can be applied to a wide range of problems, feature engineering is more domain specific. It’s an opportunity to be creative. Domain knowledge, intuition and a healthy dose of magic and trickery can be helpful.

If you have selected an appropriate set of features then adding more data will invariably improve the quality of your model. If, however, you don’t have appropriate features, no matter how much data you throw at the problem, your model will always be disappointing.

The quality of your model without the right features.

The quality of your model without the right features.

1.2.2 Training and Generalisation

The ML process consists of three stages:

  • Training (or Learning): using data to build a model;
  • Testing (or Validation): assessing whether the model is effective; and
  • Predicting: applying the model to novel, unseen data.

The prediction stage is also known as Generalisation. A Machine Learning algorithm learns from a finite (although potentially enormous) set of examples. In the process it generates a model of the patterns in the data. If the training has been successful then the resulting model will be able to make predictions for a far wider (essentially infinite) set of data, covering the entire input space. In essence the model will smoothly interpolate between the examples in the training data.

Although it’s generally possible to achieve reasonable accuracy in training, generalisation accuracy is much more important (and not necessarily that easy to achieve!).

In order to assess how well a model generalises to new data we split the data into two sets: training and testing. The model is trained on the former and evaluated on the latter. Making this split is critical because the error estimates that one obtains from the model applied to the training data are misleading. The model has been trained on those data, so of course it’s going to do well! It’s only when we apply the model to fresh (“unseen” or “holdout”) data that we get a true indication of the model’s worth.

ML aims to extract information from training data and then use this information to make predictions on unseen data. The capacity of a ML algorithm to generalise is critical. Regardlesss of how much training data we have, we’re not likely to see the same examples again in new data. Consider this rather extreme example: a binary (two class) problem with 20 features and 10000 training records. It would seem that the volume of data should be ample for a binary problem, but most of the feature space is unsampled:

2**20 - 10000
[1] 1038576

The above example illustrates one of the major challenges with learning in a high dimensional feature space: as the dimension increases, the proportion of the space for which we have data rapidly declines.

1.2.3 Accuracy and Precision

Ideally a ML model should achieve high accuracy and precision. But what do these terms mean? Consider the schematic illustration below.

Illustration of Bias and Variance.

Illustration of Bias and Variance.

A model with high precision (or low variance) when presented with similar input data will produce results which are tightly clustered together. By contrast, a model with low precision (or high variance) will generate wildly different predictions for similar input data. High variance can be caused by noise in the data but can also be a result of overfitting. More powerful models are increasingly likely to suffer from overfitting.

A model with high accuracy (or low bias) will produce predictions which are close to the “correct” result. Whereas, a model with low accuracy (or high bias) will make predictions which are consistently far from the “correct” result.

Simple models and a lot of data trump more elaborate models based on less data.Peter Norvig

Every model will suffer from bias and variance to some degree. We should aim to minimise both. A simple model trained with lots of data will probably yield better results than a complicated model with less data. It’s generally a good idea to try a simple model first before progressing to something more sophisticated.

1.3 Stages in Machine Learning

There are a few important steps that should be taken before you even think about training a Machine Learning model:

  • Indentify the problem to be solved.
  • Define the appropriate target feature.
  • Specify how the model will be assessed.
  • Assemble the required data.

1.4 What Can Machine Learning Do?

There are two main classes of Machine Learning techniques:

  • Supervised Learning and
  • Unsupervised Learning.

In Supervised Learning each observation has been assigned a label (either numerical or categorical) and it’s the job of the algorithm to learn how to predict that label. The type of the label leads to two further subclasses of model: Regression and Classification, where a regression model predicts a numerical label while a classification model predicts a categorical label.

There are no labels assigned in Unsupervised Learning. The algorithm itself discovers the similarities within different subsets of the data.

Below are some of the questions that can be addressed using Machine Learning:

1.4.1 Is this an example of Class A or Class B?

The problem of classification involves assigning items to one of two or more classes. This is the most common type of machine learning problem. The classes are always discrete, mutually exclusive and generally exhaustive. For example:

  • True or False;
  • Yes or No;
  • Smoking or Non-Smoking;
  • New or Returning customer;
  • negative, neutral or positive sentiment; or
  • S, M, L, XL.

1.4.2 How many? How much?

A regression model will predict a numerical quantity. For example,

  • What will the temperature be tomorrow?
  • How many visitors will my web site receive this year?
  • How long will it take students to solve a problem?

1.4.3 Is this Unusual?

Anomaly detection algorithms aim to identify items that are out of the ordinary. This might superficially seem like a two class classification problem. However, the difference is that with classification you would generally have a number of examples of each class to train the algorithm on. With anomaly detection you don’t. You should have an idea of what “normal” data look like. You’ll need to quantify the difference from “normal” and decide on how far from “normal” would be considered anomalous.

1.4.4 What is the structure? How is it organised?

All of the previous questions can be addressed using supervised learning techniques. However, when trying to uncover structure in a data set there is generally no prior information assigning items to classes or even indicating how many classes there should be. Unsupervised learning algorithms attempt to find natural structure in the data.

If supervised learning is picking out planets from among the stars in the night sky, then clustering is inventing constellations.Brandon Rohrer

1.4.5 Can these data be simplified?

Dimensionality reduction techniques take advantage of redundancy in the data to remove or consolidate features.

1.4.6 What is the Best Next Action?

Reinforcement learning techniques learn the optimal response to a given set of circumstances based on feedback. These techniques are appropriate for automated systems which must operate autonomously. One of the attractive characteristics of reinforcement learning is that it can gets started with little or no data: it learns as it goes.

1.5 Machine Learning Continuum

The objectives for a Machine Learning project generally lie somewhere on a continuum between hindsight (understanding the past) and foresight (predicting or affecting the future). There are four discrete stages:

  • Descriptive Analytics in which you delve into the data to figure out what has happened;
  • Diagnostics where you formulate hypotheses for why it happened;
  • Predictive Analytics relates to predicting what will happen in the future; and
  • Prescriptive Analytics is the process of figuring out ways to actually make things happen in the future.

1.6 Not a Silver Bullet

Machine Learning techniques are able to achieve amazing results. Unfortunately they are not a silver bullet. Part of the challenge is choosing the right algorithm to apply to a given set of data. Consider the classification problem below which has been attacked using a variety of algorithms. The boundary between the two classes of points is clearly not linear, so a linear technique like Logistic Regression fails miserably. A Decision Tree is slightly better, at least picking out some of the structure in the data. A Random Forest (loosely speaking, a more sophisticated version of the Decision Tree), manages to at least separate the two classes of points although the decision boundary does not seem to represent the spirit of the data. Finally, the fully non-linear SVM model appears to completely grok the data. This illustrates the case where a more complex model is required in order to fully capture the features in the data.

These models are probably going to seem rather foreign right now. Don’t worry. We’ll be looking at each of them in detail later. For the moment all you need to know is that they represent increasing levels of complexity (or flexibility), from Logistic Regression (the simplest), through Decision Trees and Random Forests, to SVM (the most complex).

Let’s look at a different set of data. Here the difference between classes is clear. So we might expect the Logistic Regression model to perform somewhat better. However, since it can only produce a single decision boundary, there is no way for it to fully capture the data. Of the remaining three models the Decision Tree and Random Forest performs well. Of these, the Decision Tree is preferred since it is a simpler model. The flexible nature of the SVM leads to it overfitting the underlying data.

1.7 No Free Lunch

There are a great number of ML algorithms available and new ones are being developed all the time. Some algorithms are better suited to a particular problem than others. Is there one particular algorithm that is universally (and fundamentally) better than all of the others for all possible problems? David Hume, an eighteenth century Scottish philosopher, made some observations which are pertinent to this question.

Thus not only our reason fails us in the discovery of the ultimate connexion of causes and effects, but even after experience has informed us of their constant conjunction, it is impossible for us to satisfy ourselves by our reason, why we should extend that experience beyond those particular instances, which have fallen under our observation. We suppose, but are never able to prove, that there must be a resemblance betwixt those objects, of which we have had experience, and those which lie beyond the reach of our discovery. David Hume, A Treatise of Human Nature

Okay, that’s pretty intense. Let’s look at another one quote that’s a little easier to digest.

…even after the observation of the frequent or constant conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience… David Hume, A Treatise of Human Nature

Hmmmm. The waters are still pretty muddy. Let’s drag ourselves to the end of the twentieth century and see if things get clearer. David Wolpert and William Macready formulated the “No Free Lunch” Theorem which essentially states that no one algorithm is optimal for every problem. The assumptions and characteristics of an algorithm which make it ideally suited to one set of data may make it completely inappropriate in another. As a result it is common practice to assess the performance of multiple models.

2 Exercises

Resources

Domingos, Pedro. 2012. “A Few Useful Things to Know about Machine Learning.” Communications of the Association for Computing Machinery 55 (10): 78–87. doi:10.1145/2347736.2347755.

Kuhn, Max. 2008. “Building Predictive Models in R Using the caret Package.” doi:10.18637/jss.v028.i05.

Wolpert, David H. 1992. “Stacked Generalization.” Neural Networks 5 (2): 241–59. doi:10.1016/S0893-6080(05)80023-1.